index.md (5542B)
1 +++ 2 title = 'Matrix models' 3 template = 'page-math.html' 4 +++ 5 # Matrix models 6 7 ## Recommender systems 8 using the example of movie recommendations for users. 9 10 Recommendation using only explicit info: 11 - we have no representation for users or movies, only 'atomic' objects that we want to compare for similarity 12 - we saw this with word embedding, represented each word by its own vector and optimised values of vectors 13 14 embedding models: 15 - train length k embedding for each user, and one for each movie, and arrange into matrices U and M. 16 - the matrices are parameters of the model 17 18 ![9277af676d9256fb745e41d8cf59dcd1.png](caf202f13ca04b06a4a3d1e9e8a3c702.png) 19 20 to make a prediction, define that dot product of user vector and movie vector should be high if user would like the movie. 21 this is a choice, but it's a logical one. 22 23 multiplying U<sup>T</sup> with M gives matrix of rating predictions for every user/movie pair. 24 so we want to take rating matrix R, and decompose it as product of two factors ("matrix factorization/decomposition") 25 26 ## Matrix factorization 27 You get an optimisation problem: choose U and M st U<sup>T</sup> M ≈ R. 28 29 $\begin{aligned} 30 \argmin_{U,M} &= \| R - U^T M \|_{F}^{2} \\ 31 &= \sum_{i,j} (R_{ij} - \lbrack U^T M \rbrack_{ij})^2 \\ 32 &= \sum_{i,j} (R_{ij} - u_{i}<sup>{T} m_j)</sup>2 33 \end{aligned}$ 34 35 but, R is not complete, for most user/movie pairs we don't know the rating. 36 so, sometimes it's better to only optimise for known ratings: 37 38 $\begin{aligned} \argmin_{U,M} \sum_{i,j \in R_{\text{known}}} (R_{ij} - u_{i}<sup>{T} m_j)</sup>2 \end{aligned}$ 39 40 Alternating least squares - alternative to gradient descent 41 - idea: if we know M, computing U is easy, and vice versa 42 - so, starting with random U and m: 43 - fix M, compute new U 44 - fix U, compute new M 45 46 Stochastic gradient descent is usually more practical. 47 48 If we only have positive ratings, we have two options: 49 - ensure that U<sup>T</sup> M always represent probabilities; maximise probability of data. 50 - sample random movie user pairs as negative training samples (i.e., assume that users don't really know) 51 52 ### Bias control 53 - control for user bias 54 - ratings depend between users, they're subjective (no shit) 55 - if we can explicitly model bias of a user, the matrix factorisation only needs to predict how much a user would deviate from their average rating for a particular movie 56 - control for movie bias 57 - some movies are universally hated, some are universally loved 58 - unpopular opinion: The Rise of Skywalker wasn't really that bad 59 - control for temporal bias 60 - data is not stable over time 61 - e.g. meaning of specific ratings can change 62 - 63 For user/movie biases, use an additive scalar, which is learned along with embeddings. 64 One for each user, one for each movie, and one general bias over all ratings. 65 66 Make the biases and embeddings time dependent. 67 e.g. cut time into a small number of chunks, learn separate embedding for each chunk. 68 69 ### The 'cold start' problem 70 When a user/movie is added to the database, we have no ratings, so matrix factorization can't build an embedding. 71 Have to rely on implicit feedback and side information. 72 73 Using implicit "likes" (e.g. movies watched but not rated, movies browsed, movies hovered over...) 74 - add separate movie embedding x, compute second user embedding 75 - this is sum of x-embeddings of all movies user x has liked 76 - $u_{i}^{imp} = \sum_{j \in N(i)} x_j$ 77 - add this implicit feedback embedding to existing one before computing dot product 78 79 Using side info (age, login, browser resolution...) 80 - for simplification, assume all features are binary 81 - assign each feature an embedding, sum over all features that apply to user, creating third user embedding 82 - $u_{i}^{side} = \sum_{f \in A(i)} y_f$ 83 - then you add this before computing dot product 84 85 ## Graph models 86 graph convolutional network: 87 - start with node embeddings N₀ 88 - compute new embedding for each node: average of neighbor embeddings and its own - AN₀ 89 - multiply by weight matrix W, add activation - N₁ = σ(AN₀W) 90 - if you apply this multiple times, you get a multi-layered structure 91 92 Link prediction: assume graph is incomplete, try to predict which nodes should be linked. 93 We can treat this like a matrix factorization problem. 94 GCN: perform convolutions on embeddings, multiply them out to generate predictions, compare to training data, and backpropagate loss 95 96 Node classification: for each node, try to predict a label. 97 With node embeddings, we can use a regular classifier, but how do we get good embeddings? 98 GCN for classification: ensure embedding size of last layer is equal to number of classes, apply softmax activation to embeddings, interpret them as probabilities over classes 99 100 GCN issues: 101 - depth is a problem, high connectivity diffuses info 102 - usually full-batch, can't easily break up graph into minibatches 103 - pooling not selective, all neighbors mixed equally before weights are applied 104 105 ## Validating embedding models 106 inductive vs transductive learning: in transduction, learning is allowed to see features of all data, but labels of only training data. 107 embedding models only support transductive learning; if we don't know objects until after training, won't have embedding vectors. 108 like with graph models, we need to know the _whole_ graph. 109 110 to evaluate matrix factorization, give training alg all users and movies, but withhold some ratings. 111 same for links in a graph. 112 for node classification, give the alg the whole graph, and table linking node ids to labels, but withhold some labels.